Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

ݔଶ,

, ݔௗሻ and ܡ

ሺݕଵ, ݕଶ,

, ݕௗሻ, the Euclidean distance

them is defined as below,

ܦ^ாሺܠ, ܡሻൌ¹

݀^෍ሺݔ^௜^െݕ^௜^ሻ^ଶ

ௗ

௜ୀଵ

(2.15)

Hamming distance [Hamming, 1950] is commonly used for

data and it is defined as below,

ܦ^ுሺܠ, ܡሻൌ¹

݀^෍|ݔ^௜^െݕ^௜^|

ௗ

௜ୀଵ

(2.16)

ance measures the dissimilarity. A correlation coefficient can be

measure how two data points are similar to each other. The

n of the correlation coefficient for two vectors (hence two data

and y is shown below, where ߤ௫ and ߤ௬ stand for the population

ܠ^ଶ^and^ߪܡ^ଶ^{stand for the variances of two populations,}

ߩሺܠ, ܡሻൌ

∑

ሺݔ௜െߤ௫ሻሺݕ௜െߤ௬ሻ

ௗ

௜ୀଵ

ሺ݀െ1ሻටߪܠ^ଶߪܡ^ଶ

(2.17)

partitioning strategy and the grouping strategy are two major

g strategies. For the former, a data space is partitioned into a

of subspaces, each of which is a cluster. Each subspace is

rised by a cluster centre. Sometimes, the variance or covariance

er is also considered as a parameter to describe a subspace. Each

nt is labelled through minimising its distance with the cluster

f a cluster model. Most clustering algorithms employing the

ng strategy are the parameterised models. This means that the

ge of how data points are grouped or clustered can be saved in a

f model parameters such as the cluster centres and the variances.

ved model parameters can be used for the inference on novel data.

eans algorithm is such a typical algorithm.